Search CORE

42 research outputs found

Energy-Delay Tradeoff Analysis of ILP-based Compilation Techniques on a VLIW Architecture

Author: Bodin François
Pokam Gilles
Publication venue: HAL CCSD
Publication date: 01/01/2003
Field of study

Energy consumption is becoming an important issue on modern processors, especially on embedded systems. While many architectural solutions to reduce energy consumption exist, software solutions on the other hand mostly rely on performance optimization techniques. The rationale behind this latter approach follows the rule that energy consumption is roughly proportional to the execution time. While some ILP techniques allow to increase performance by eliminating redundant instructions, others however do increase the total instruction count, mitigating their benefit as far as energy consumption is concerned. This paper explores the energy-delay tradeoff of ILP enhancing techniques at the compilation level. The goal is to develop a theoretical understanding of the main energy issues involved by forming ILP blocks. We present an analytical methodology which essentially exploits the variations in program performance to identify conditions leading to energy consumption increase. Our results show that there exists a threshold above which ILP enhancing optimizations may necessarily turn into diminishing energy reduction returns. The proposed tradeoff analysis reveals that this can be mainly attributed to the limited available instruction parallelism of applications which causes wasted computation and, to some extent, machine overhead to start dominating the energy consumption in some scenarios where the ILP is pushed above a given threshold

INRIA a CCSD electronic archive server

A Case for a Complexity-Effective, Width-partitioned Microarchitecture

Author: Pokam Gilles
Rochecouste Olivier
Seznec André
Publication venue: HAL CCSD
Publication date: 01/01/2005
Field of study

Current superscalar processors feature 64-bit datapaths to execute the program instructions, regardless of their operands size. Our analysis indicates, however, that most executions comprise a large amount (40%) of narrow-width operations; i.e. instructions which exclusively process narrow-width operands and results. We further noticed that these operations are well distributed across a program run. In this paper, we exploit these properties to master the hardware complexity of superscalar processors. We propose a width-partitioned microarchitecture (WPM) to decouple the treatment of narrow-width operations from that of the other program instructions. We split a 4-way issue processor into two clusters: one executing 64-bit operations, load/store and complex operations and the other treating the 16-bit operations. We show that revealing the narrow-width operations to the hardware is sufficient to keep the workload balanced and the communications minimized between clusters. Using a WPM reduces the complexity of several critical processor components: register file and bypass network. A WPM also lowers the complexity of the interconnection fabric since the 16-bit cluster is only able to propagate narrow-width data. We examine simple configurations of WPM while discussing their tradeoffs. We evaluate a speculative heuristic to steer the narrow-width operations towards clusters. A detailed complexity analysis shows using a WPM model saves power and area with a minimal impact on performance. \\ Les processeurs superscalaires actuels implémentent des chemins de données 64-bit pour exécuter les instructions d'un programme, indépendamment de la taille des opérandes. Notre analyse indique toutefois que la plupart des exécutions comportent une fraction considérable (40%) en opérations tronquées ; c-à-d. les instructions manipulant exclusivement des opérandes et des résultats de petite dimension. Nous avons aussi remarqué que les opérations tronquées sont bien distribuées au cours d'une exécution. Cette étude exploite ces propriétés pour maîtriser la complexité des processeurs superscalaires. Pour ce faire, nous proposons une microarchitecture clusterisée (WPM) pour découpler le traitement des opérations tronquées de celui des autres instructions du programme. Nous partitionnons ainsi le processeur entre deux clusters : un cluster 64-bit exécutant les opérations 64-bit, load/store et complexes, et un cluster 16-bit traitant les opérations 16-bit. Considérant les propriétés relatives aux opérations tronquées, nous montrons que révéler ces dernières au matériel est suffisant pour maintenir l'équilibrage des charges et minimiser les communications entre les clusters. Le modèle WPM réduit efficacement la complexité de plusieurs composants critiques du processeur : fichier de registres, réseau de bypass. Ce modèle réduit aussi la complexité du réseau d'interconnexion car le cluster 16-bit peut uniquement propager des données 16-bit. Nous examinons différentes configurations du WPM en discutant de leurs compromis. Nous évaluons une heuristique spéculative pour distribuer les instructions vers les clusters. Une analyse détaillée de la complexité indique que le modèle WPM réduit la consommation et la surface de silicium avec un impact minimal sur les performances

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

HAL-Rennes 1

LASER: Light, Accurate Sharing dEtection and Repair

Author: Devietti Joseph
Fugate Brooke
Hu Shiliang
Luo Liang
Newburn Chris J
Pokam Gilles
Sriraman Akshitha
Publication venue: ScholarlyCommons
Publication date: 14/03/2016
Field of study

Contention for shared memory, in the forms of true sharing and false sharing, is a challenging performance bug to discover and to repair. Understanding cache contention requires global knowledge of the program\u27s actual sharing behavior, and can even arise invisibly in the program due to the opaque decisions of the memory allocator. Previous schemes have focused only on false sharing, and impose significant performance penalties or require non-trivial alterations to the operating system or runtime system environment. This paper presents the Light, Accurate Sharing dEtection and Repair (LASER) system, which leverages new performance counter capabilities available on Intel\u27s Haswell architecture that identify the source of expensive cache coherence events. Using records of these events generated by the hardware, we build a system for online contention detection and repair that operates with low performance overhead and does not require any invasive program, compiler or operating system changes. Our experiments show that LASER imposes just 2% average runtime overhead on the Phoenix, Parsec and Splash2x benchmarks. LASER can automatically improve the performance of programs by up to 19% on commodity hardware

Crossref

ScholarlyCommons@Penn

MPreplay: Architecture Support for Deterministic Replay of Message Passing Programs on Message Passing Many-Core Processors

Author: Erik-Svensson Carl
Kesler David
Kumar Rakesh
Pokam Gilles
Publication venue: Coordinated Science Laboratory, University of Illinois at Urbana-Champaign
Publication date: 01/04/2009
Field of study

Coordinated Science Laboratory was formerly known as Control Systems Laborator

Illinois Digital Environment for Access to Learning and Scholarship Repository

Randomness Concerns When Deploying Differential Privacy

Author: Boneh Dan
Das S.
Gutterman Z.
Hotz V. Joseph
Inayah K.
Mironov Ilya
National
Naveed Muhammad
Noll Landon Curt
O'Neill Melissa E.
Pokam Gilles
Ruhault Sylvain
Salmon J. K.
Services Amazon Web
Shema Mike
Sheppard Kevin
Shrimpton Thomas
US Census Bureau
von Neumann Jon
Weaver V. M.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 06/09/2020
Field of study

The U.S. Census Bureau is using differential privacy (DP) to protect confidential respondent data collected for the 2020 Decennial Census of Population & Housing. The Census Bureau's DP system is implemented in the Disclosure Avoidance System (DAS) and requires a source of random numbers. We estimate that the 2020 Census will require roughly 90TB of random bytes to protect the person and household tables. Although there are critical differences between cryptography and DP, they have similar requirements for randomness. We review the history of random number generation on deterministic computers, including von Neumann's "middle-square" method, Mersenne Twister (MT19937) (previously the default NumPy random number generator, which we conclude is unacceptable for use in production privacy-preserving systems), and the Linux /dev/urandom device. We also review hardware random number generator schemes, including the use of so-called "Lava Lamps" and the Intel Secure Key RDRAND instruction. We finally present our plan for generating random bits in the Amazon Web Services (AWS) environment using AES-CTR-DRBG seeded by mixing bits from /dev/urandom and the Intel Secure Key RDSEED instruction, a compromise of our desire to rely on a trusted hardware implementation, the unease of our external reviewers in trusting a hardware-only implementation, and the need to generate so many random bits.Comment: 12 pages plus 2 pages bibliograph

arXiv.org e-Print Archive

Crossref

embedded

Author: Gilles Pokam
Publication venue
Publication date
Field of study

Energy-efficiency potential of a phase-based cache resizing scheme fo

CiteSeerX

Energy reduction potential of a phase-based cache resizing scheme for embedded systems

Author: Bodin François
Pokam Gilles
Publication venue: HAL CCSD
Publication date: 01/01/2003
Field of study

Managing the energy-performance tradeoff has become a major challenge on embedded systems. The cache hierarchy is a typical example of such a design target where this tradeoff plays a central role. With the increasing level of integration density, a cache can feature several billions of transistors, consuming a large proportion of the energy. In the same time however, it also allows to considerably improve the performance. Configurable caches are becoming the de-facto solution to deal efficiently with that problem. Such caches are equipped with artifacts that enable one to resize it dynamically. In the context of embedded systems, however, many of these artifacts restrict the configurability at the application level. We propose in this paper to modify the structure of a configurable cache to offer embedded compilers the opportunity to reconfigure it according to a program dynamic phase, rather than on a per-application basis. We show in our experimental results that the proposed scheme has a potential for improving the compiler effectiveness to reduce the energy consumption, while not excessively degrading the performance

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

HAL-Rennes 1

Techniques de compilation pour la gestion et l'optimisation de la consommation d'énergie des architectures VLIW

Author: BODIN François
POKAM Gilles
Publication venue
Publication date: 01/01/2004
Field of study

Cette thèse s'emploie principalement à réduire la consommation d'énergie des architectures VLIW tout en essayant de préserver le maximum de performance possible. Contrairement à certaines approches logicielles tendant à favoriser l'optimisation de code pour obtenir des gains en énergie, nous présentons des arguments en faveur d'une approche synergique intégrant matériel et logicielà la fois. L'idée principale défendue tout au long de cette thèse repose sur le fait que seule une compréhension avançée du comportement dynamique d'un programme au niveau du compilateur est susceptible de produire un meilleur contrôle de la gestion de la consommation d'énergie. Pour cela, nous introduisons une technique d'analyse statique du comportement dynamique d'un programme afin de parvenir à identifier et à caractériser les chemins les plus fréquemment exécutés d'un programme. L'objectif visé étant la réduction de la consommation d'énergie, nous montrons par la suite que sur directive du compilateur, l'architecture de la machine peut être modifiée pour s'adapter à un état dynamique particulier du programme. Nous présentons les conditions d'une telle reconfiguration ainsi que les éventuelles modifications à apporter à l'architecture, à la fois pour le système des caches que pour le chemin de données d'un processeur VLIW.RENNES1-BU Sciences Philo (352382102) / SudocSudocFranceF

OpenGrey Repository

Energy-Delay Tradeoff Analysis of ILP-based Compilation Techniques on a VLIW Architecture

Author: Bodin François
Pokam Gilles
Publication venue: HAL CCSD
Publication date: 01/01/2003
Field of study

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

HAL-Rennes 1

BugNet: Continuously recording program execution for deterministic replay debugging

Author: Brad Calder
Gilles Pokam
Satish Narayanasamy
Publication venue
Publication date: 01/01/2005
Field of study

Significant time is spent by companies trying to reproduce and fix the bugs that occur for released code. To assist developers, we propose the BugNet architecture to continuously record information on production runs. The information collected before the crash of a program can be used by the developers working in their execution environment to deterministically replay the last several million instructions executed before the crash. BugNet is based on the insight that recording the register file contents at any point in time, and then recording the load values that occur after that point can enable deterministic replaying of a program’s execution. BugNet focuses on being able to replay the application’s execution and the libraries it uses, but not the operating system. But our approach provides the ability to replay an application’s execution across context switches and interrupts. Hence, BugNet obviates the need for tracking program I/O, interrupts and DMA transfers, which would have otherwise required more complex hardware support. In addition, BugNet does not require a final core dump of the system state for replaying, which significantly reduces the amount of data that must be sent back to the developer.

CiteSeerX